Interactively building a topic model employing semantic similarity in a spoken dialog system

ABSTRACT

A computer-implemented method is presented for building a topic model to discover topics in a collection of documents generated by a plurality of users. The method includes extracting conversations from the collection of documents, dividing the extracted conversations into a plurality of segments, generating a topic distribution for each of the plurality of segments based on the extracted conversations and a first pre-defined prior probability distribution, and generating continuous value constructs for each of the topic distributions based on an external corpus and a second pre-defined prior probability distribution, wherein similarity is defined between the continuous value constructs.

BACKGROUND Technical Field

The present invention relates generally to dialog systems, and morespecifically, to interactively building a topic model employing semanticsimilarity in a dialog system.

Description of the Related Art

Spoken language understanding is a key component in human-computerconversational interaction systems. Existing spoken dialog systemsoperate in single-user scenarios, where a user speaks to the system andthe system provides feedback in response to the user's request. Manyexisting spoken dialog systems are application-specific and capable ofresponding only to requests within limited domains. Each domainrepresents a single content area such as search, movie, music,restaurant, shopping, flights, etc. Limiting the number of domainsgenerally allows spoken dialog systems to be more accurate, but requiresthe user to resort to different resources for different tasks.

SUMMARY

In accordance with an embodiment, a method is provided for building atopic model to discover topics in a collection of documents generated bya plurality of users. The method includes extracting conversations fromthe collection of documents, dividing the extracted conversations into aplurality of segments, generating a topic distribution for each of theplurality of segments based on the extracted conversations and a firstpre-defined prior probability distribution, and generating continuousvalue constructs for each of the topic distributions based on anexternal corpus and a second pre-defined prior probability distribution,wherein similarity is defined between the continuous value constructs.

In accordance with another embodiment, a system is provided for buildinga topic model to discover topics in a collection of documents generatedby a plurality of users. The system includes a memory and one or moreprocessors in communication with the memory configured to extractconversations from the collection of documents, divide the extractedconversations into a plurality of segments, generate a topicdistribution for each of the plurality of segments based on theextracted conversations and a first pre-defined prior probabilitydistribution, and generate continuous value constructs for each of thetopic distributions based on an external corpus and a second pre-definedprior probability distribution, wherein similarity is defined betweenthe continuous value constructs.

In accordance with yet another embodiment, a non-transitorycomputer-readable storage medium including a computer-readable programfor building a topic model to discover topics in a collection ofdocuments generated by a plurality of users is presented. Thenon-transitory computer-readable storage medium performs the steps ofextracting conversations from the collection of documents, dividing theextracted conversations into a plurality of segments, generating a topicdistribution for each of the plurality of segments based on theextracted conversations and a first pre-defined prior probabilitydistribution, and generating continuous value constructs for each of thetopic distributions based on an external corpus and a second pre-definedprior probability distribution, wherein similarity is defined betweenthe continuous value constructs.

It should be noted that the exemplary embodiments are described withreference to different subject-matters. In particular, some embodimentsare described with reference to method type claims whereas otherembodiments have been described with reference to apparatus type claims.However, a person skilled in the art will gather from the above and thefollowing description that, unless otherwise notified, in addition toany combination of features belonging to one type of subject-matter,also any combination between features relating to differentsubject-matters, in particular, between features of the method typeclaims, and features of the apparatus type claims, is considered as tobe described within this document.

These and other features and advantages will become apparent from thefollowing detailed description of illustrative embodiments thereof,which is to be read in connection with the accompanying drawings.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

The invention will provide details in the following description ofpreferred embodiments with reference to the following figures wherein:

FIG. 1 is an exemplary generation model, in accordance with anembodiment of the present invention;

FIG. 2 is an exemplary generation model of FIG. 1 with furtherexplanations regarding the embedded vector space, in accordance with anembodiment of the present invention;

FIG. 3 is an exemplary diagram illustrating the differences between thepresent graphical model and convention graphical models, in accordancewith an embodiment of the present invention;

FIG. 4 is a block/flow diagram of an exemplary method for building atopic model, in accordance with an embodiment of the present invention;

FIG. 5 is a block/flow diagram of an exemplary method for generating andproposing a new topic model to a user relating to a new conversation, inaccordance with an embodiment of the present invention;

FIG. 6 is an exemplary processing system incorporating a dialog system,in accordance with embodiments of the present invention;

FIG. 7 is a block/flow diagram of an exemplary cloud computingenvironment, in accordance with an embodiment of the present invention;and

FIG. 8 is a schematic diagram of exemplary abstraction model layers, inaccordance with an embodiment of the present invention.

Throughout the drawings, same or similar reference numerals representthe same or similar elements.

DETAILED DESCRIPTION

Embodiments in accordance with the present invention provide methods anddevices for building a topic model in a dialog system, which segmentsconversations, generates continuous-value constructs, estimatesparameters, and assigns weights to each construct by using, e.g., Gibbssampling.

A spoken dialog system is a computer-based machine designed to conversewith a human. A dialog between the machine and the user relies onturn-taking behavior. For example, a user can ask the machine to locatean Italian restaurant in downtown. In response to the request, themachine can state it was unable find any Italian restaurants indowntown. The user's request and the machine act or response form oneturn in the dialog. As the dialog progresses, the spoken dialog systemis able to obtain the information needed to complete one or more usergoals (e.g., provide the name and location of an Italian restaurant).

Conventional dialog systems are widely used in the informationtechnology industry, especially in the form of mobile applications forwireless telephones and tablet computers. Generally, a dialog systemrefers to a computer-based agent having a human-centric interface foraccessing, processing, managing, and delivering information. Dialogsystems are also known as chat information systems, spoken dialogsystems, conversational agents, chatter robots, chatterbots, chatbots,chat agents, digital personal assistants, automated online assistants,and so forth. All these terms are within the scope of the presentdisclosure and referred to as a “dialog system” for simplicity.

Traditionally, a dialog system interacts with its users in naturallanguage to simulate an intelligent conversation and providepersonalized assistance to the users. For example, a user can generaterequests to the dialog system in the form of conversational questions,such as “Where is the nearest hotel?” or “What is the weather like inNew York?” and receive corresponding answers from the dialog system inthe form of audio and/or displayable messages. The users can alsoprovide voice commands to the dialog system requesting the performanceof certain functions including, for example, generating e-mails, makingphone calls, searching particular information, acquiring data,navigating, requesting notifications or reminders, and so forth. Theseand other functionalities make dialog systems very popular because theyare of great help, especially for holders of portable electronic devicessuch as smart phones, cellular phones, tablet computers, gamingconsoles, and the like.

The exemplary embodiments of the present invention augment the dialogsystem experience by observing, segmenting and storing user dialogsystem conversation exchanged during a time interval in a space-vectorrepresentation, generating a topic distribution for each segment bygenerating continuous value constructs (keywords, phrases) for eachtopic based on analysis of stored conversations, an external corpus, andpre-defined probability distribution of each construct into continuousvalues from the space-vector representation, and estimating parametersof topic distribution, construct distribution, and hidden topic forgenerating and proposing a candidate topic to the user for newconversation by using Gibbs sampling.

It is to be understood that the present invention will be described interms of a given illustrative architecture; however, otherarchitectures, structures, substrate materials and process features andsteps/blocks can be varied within the scope of the present invention. Itshould be noted that certain features cannot be shown in all figures forthe sake of clarity. This is not intended to be interpreted as alimitation of any particular embodiment, or illustration, or scope ofthe claims.

FIG. 1 is an exemplary generation model, in accordance with anembodiment of the present invention, whereas FIG. 2 is an exemplarygeneration model of FIG. 1 with further explanations regarding theembedded vector space, in accordance with an embodiment of the presentinvention.

Accurate prediction of conversation topics can be a valuable signal forcreating coherent and engaging dialog systems. Detecting conversationtopics and keywords can be used to guide dialog systems towards coherentdialog.

In machine learning and natural language processing, a topic model is atype of statistical model for discovering the abstract “topics” thatoccur in a collection of documents. Topic modeling is a frequently usedtext-mining tool for discovery of hidden semantic structures in a textbody. Intuitively, given that a document is about a particular topic,one would expect particular words to appear in the document more or lessfrequently. Topic modelling is employed to extract topics from largecorpora of texts, e.g., web documents and scientific articles. A “topic”includes a cluster of words that frequently occur together. Usingcontextual clues, topic models can connect words with similar meaningsand distinguish between uses of words with multiple meanings. A “topicmodel” can also be thought of as an algorithm for discovering the mainthemes that pervade a large and otherwise unstructured collection ofdocuments. Topic models can organize the collection according to thediscovered themes.

A topic model captures this intuition in a mathematical framework, whichallows examining a set of documents and discovering, based on thestatistics of the words in each, what the topics might be and what eachdocument's balance of topics is. Latent Dirichlet Allocation (LDA) usesunsupervised learning methods, and learns the topic distributions fromthe data itself, by iteratively adjusting priors. Topic models are alsoreferred to as probabilistic topic models, which refers to statisticalalgorithms for discovering the latent semantic structures of anextensive text body. Topic models can help to organize and offerinsights to understanding collections of unstructured text.

In FIG. 1, a generation model is created as follows. Users 102, 104, 106create documents 110. The documents 110 can be, e.g., conversations. Theconversations can be between a first entity and a second entity. Thefirst and second entities can be people. However, the documents 110 canbe any type of text. The documents 110 are divided into segments. Forexample, a first document 112 created by the first user 102 can have aplurality of segments. A second document 114, a third document 116, aand a fourth document 118 can each be divided into a plurality ofsegments. A first document 120 created by the second user 104 can have aplurality of segments and a first document 122 created by the third user106 can have a plurality of segments. The segments go through aDirichlet distribution 130 so that each segment has a topic distribution140. Thus, element 140 is the topic distribution of each document (orsegment). For example, the segments of the first document 112 of thefirst user 102 can have a topic distribution 142, the segments of thesecond document 114 of the first user 102 can have a topic distribution144, and the segments of the first document 120 of the second user 104can have a topic distribution 146. Therefore, each document of each useris divided into a plurality of segments and each segment from eachdocument of all the users has a topic distribution. Stated differently,a topic distribution θ_(d) is generated for each document d for eachperson's segment (e.g., seasons), by using Dirichlet distribution whoseparameter is α.

In mathematical terms:Topic distribution for d: θ _(d)={θ_(d,1),θ_(d,2), . . . θ_(d,K)}

θ_(d,k): probability of selecting topic k for dθ_(d)˜Diriclet(α)p(θ_(d)|α)∝Π_(k=1) ^(K)θ_(d,k) ^(α) ^(k) ⁻¹

Embedded keywords and phrases 150 can be extracted or collected from thedocuments 110 and can be mapped. The mapping is of the mean of embeddedwords for each topic. Word embedding is the collective name for a set oflanguage modeling and feature learning techniques in natural languageprocessing (NLP) where words or phrases from the vocabulary are mappedto vectors of real numbers. Word embeddings are thus vectorrepresentations of a particular word. Word2Vec is one popular techniqueto learn word embeddings using shallow neural networks. Word2Vec is atwo-layer neural net that processes text. Its input is a text corpus andits output is a set of vectors: feature vectors for words in thatcorpus. While Word2Vec is not a deep neural network, it turns text intoa numerical form that deep nets can understand. The purpose andusefulness of Word2Vec is to group the vectors of similar words togetherin vector space. That is, Word2Vec detects similarities mathematically.Word2Vec creates vectors that are distributed numerical representationsof word features, features such as the context of individual words.Word2Vec does so without human intervention. Given enough data, usageand contexts, Word2Vec can make highly accurate guesses about a word'smeaning based on past appearances. Those guesses can be used toestablish a word's association with other words (e.g., “man” is to “boy”what “woman” is to “girl”), or cluster documents and classify them bytopic. The output of the Word2Vec neural net is a vocabulary in whicheach item has a vector attached to it, which can be fed into adeep-learning net or simply queried to detect relationships betweenwords.

Regarding the embedded words, the exemplary methods set document d's nthembedding word to be ew_(d,n) and the hidden topic of the embedding wordto be z_(d,n). z_(d,n) is generated from a categorical distribution withparameter θ_(d) and ew_(d,n) is generated from a Gaussian distributionwith parameters μ_(k) (k=z_(d,n)).

In mathematical terms:The hidden topic for embedded word of d:z={z _(d,1) ,z _(d,2) , . . . z_(d,N) _(d) }z _(d,n)˜Categorical(θ_(d))p(z _(d,n)|θ_(d))=θ_(d,z) _(d,n)ew _(d,n) ˜N(μ_(z) _(d,n) )p(ew _(d,n)|μ_(z) _(d,n) )∝exp(−½(ew _(d,n)−μ_(z) _(d,n) )^(T)(ew_(d,n)−μ_(z) _(d,n) ))

The above equations can be employed to indicate to which topic eachembedded word is categorized.

After similarity is defined between constructs, it is determined if theconstructs are close to or far from the mean value in the topic (ortheme). The constructs distribution is evaluated based on a Gaussiandistribution 170 with a fixed mean and variance. The Gaussiandistribution of different topic categories is also shown (162, 164, 166,168). For example, Gaussian distribution 162 can be a continuous-valueconstruct (keywords, phrases) distribution generated from a topic ortheme derived from the second document 114 of the first user 102.Similarly, the Gaussian distribution 166 can be a continuous-valueconstruct distribution generated from a topic or theme derived from thefirst document 120 of the second user 104. Therefore, a Gaussiandistribution can be generated for each topic of each document of eachuser.

Stated differently, parameter μ_(k) is employed for distribution ofembedding word of topic k from Gaussian distribution with parameter μ₀,σ₀ ²I.

In mathematical terms:μ_(k) ˜N(μ₀,σ₀ ² I)μ_(k)={μ_(k,1),μ_(k,2), . . . μ_(k,V)}

Referring back to FIG. 1, the generation model creates segments whichcorrespond to conversation in a certain period (e.g., seasons), and ageneration model is created in which each segment has a topicdistribution. The parameters of the topic distribution can be generatedfrom an appropriate prior distribution.

Each topic (or theme) generates continuous-value constructs (keywords,phrases). Similarity is defined between any constructs, and generationprobability lowers if the construct is far from the mean value in thetopic (or theme). The parameters of distribution of constructs aregenerated from an appropriate prior distribution. In other words,generation probability depends on the relationship between the constructand the mean or a generation probability is adjusted based on a distancebetween the continuous value constructs and a mean value of acorresponding topic distribution.

The constructs (keywords, phrases) are collected from an external corpusand encoded into continuous values using, e.g., word2vec or otherembedding technologies. The corpus can be a large and unstructured setof texts or documents. Unstructured text can be information that eitherdoes not have a pre-defined data model or is not organized in apre-defined manner.

A generation model is shown in FIGS. 1 and 2. Topic distribution is acategorical distribution (multinomial distribution when n=1) with fixedtopic category numbers whose prior distribution is a Dirichletdistribution with parameter α. Also, constructs distribution for eachcategory is a Gaussian distribution with fixed variance and meanscalculated by corresponding values of constructs within the samecategory, whose prior distribution is, for example, Gaussiandistribution with mean μ₀, variance σ₀ ²I.

Regarding how Parameters are Estimated:

After observing constructs in each segment, parameters of topicdistribution, construct distribution, and hidden parameters (hiddentopic for constructs) are estimated using, e.g., Gibbs sampling. Gibbssampling or a Gibbs sampler is a Markov Chain Monte Carlo (MCMC)algorithm for obtaining a sequence of observations which areapproximated from a specified multivariate probability distribution whendirect sampling is difficult.

When the external corpus is relatively small, distributions are thoughtto be estimated because using prior distributions and the constructdistribution is a continuous distribution.

Regarding Runtime and Dynamic Updates of Parameters:

An appropriated segment is selected and by using the segment's topicdistribution, a candidate topic and construct (keyword, phrase) in thetopic are generated. Because the candidate construct is a continuousvalue, the nearest observed value is used for a topic proposal.Therefore, the methods can output proposed topics to a user.

According to whether the topic is acceptable or not, parameters of theabove generation model are updated so as to heighten the generationprobability of the accepted construct and to lower the generationprobability of the unaccepted construct. Therefore, the generationprobability of each construct can be adjusted in real-time.

Moreover, the method changes the hidden topic of the construct and nearconstructs, and re-performs Gibbs sampling. The freshness of the topicis reflected by lowering the generation probability of old constructs.The observation probability in the generation model of the constructscan be lowered by considering a time decay.

FIG. 3 is an exemplary diagram illustrating the differences between thepresent graphical model and convention graphical models, in accordancewith an embodiment of the present invention. In the present graphicalmodel, words and phrases can be represented in continuous space so thatsimilarity among those can be exploited.

LDA and other topic models are part of the larger field of probabilisticmodeling. In generative probabilistic modeling, the data is treated asarising from a generative process that includes hidden variables. Thisgenerative process defines a joint probability distribution over boththe observed and hidden random variables. Data analysis is performed byusing that joint distribution to compute the conditional distribution ofthe hidden variables given the observed variables. This conditionaldistribution is also called the posterior distribution.

LDA falls precisely into this framework. The observed variables are thewords of the documents, the hidden variables are the topic structure,and the generative process is as described herein.

Topic modeling algorithms generally fall into two categories, that is,sampling-based algorithms and variational algorithms. Sampling-basedalgorithms attempt to collect samples from the posterior to approximateit with an empirical distribution. The most commonly used samplingalgorithm for topic modeling is Gibbs sampling, where a Markov chain isconstructed, that is a sequence of random variables, each dependent onthe previous, whose limiting distribution is the posterior. The Markovchain is defined on the hidden topic variables for a particular corpus,and the algorithm is to run the chain for a long time, collect samplesfrom the limiting distribution, and then approximate the distributionwith the collected samples. Variational methods are a deterministicalternative to sampling-based algorithms. Rather than approximating theposterior with samples, variational methods posit a parameterized familyof distributions over the hidden structure and then find the member ofthat family that is closest to the posterior. Thus, the inferenceproblem is transformed to an optimization problem.

Regarding the LDA graphical model 330, it is assumed that topics existoutside the document vocabulary. Each topic is a distribution over afixed vocabulary and each word is drawn from one of those topics.Additionally, each document is a random mixture of corpus-wide topics.The goal of LDA is to infer the hidden (latent) variables, that is, tocompute their distribution conditioned on the documents.

The LDA graphical model 330 includes a Dirichlet parameter 332 (orproportions parameter) designated as α, a per-document topic proportions334 designated as θ_(d), a per-word topic assignment 336 designated asz_(d,n), an observed word 338 designated as w_(d,n), a word distribution340 designated as φ_(k), and topics 342 designated as β. Each piece ofthe structure is a random variable. Thus, from a collection ofdocuments, d, the LDA graphical model 330 infers the per-word topicassignment z_(d,n), the per-document topic proportions θ_(d), and theper-corpus topic distributions β_(k). Posterior expectations areemployed to perform the task, that is information retrieval or documentsimilarity.

In contrast, the graphical model 310 of the present invention includes aDirichlet parameter 302 (or proportions parameter) designated as α, aper-document topic proportions 304 designated as θ_(d), a per-constructtopic assignment 306 designated as z_(d,n), an embedded word 308designated as ew_(d,n), and μ_(k), which is the mean of the constructdistribution and is designated as 312, which involves calculation of amean μ₀ designated as 314 and a standard deviation σ₀ designated as 316.Thus, mean and variance data is employed in the topic distribution modelto adjust generation probability of constructs.

FIG. 4 is a block/flow diagram of an exemplary method for building atopic model, in accordance with an embodiment of the present invention.

At block 402, store conversations exchanged in a certain period of timeas segments.

At block 404, generate a topic distribution for each of the segments,based on the stored conversations, and a first pre-defined priorprobability distribution.

At block 406, generate continuous value constructs for each topicdistribution, based on an external corpus and a second pre-defined priorprobability distribution, wherein a similarity is defined as anyconstructs.

FIG. 5 is a block/flow diagram of an exemplary method for generating andproposing a new topic model to a user relating to a new conversation, inaccordance with an embodiment of the present invention.

At block 502, observe, segment, and store a user dialog systemconversation exchanged during a time interval in a space-vectorrepresentation.

At block 504, generate a topic distribution for each segment bygenerating continuous value constructs (keywords, phrases) for eachtopic, based on analysis of stored conversations, an external corpus andpre-defined probability distribution of each construct into continuousvalues from the space-vector representation.

At block 506, estimate parameters of topic distribution, constructdistribution and hidden topic for generating and proposing a candidatetopic to the user for new conversation using Gibbs sampling.

Therefore, in conclusion, the topic model generates continuous valueswhose similarity is reflected to the probabilistic model, theabove-mentioned continuous values indicate hidden meaning of words andphrases that are constructs of the conversation, and can be used whenaccumulated data are relatively small and recent topics can bereflected. Moreover, the differences between the exemplary embodimentsof the present invention and the conventional art is that existingmethods have difficulty using phrases for topic distribution because thevariation of the phrases is huge and that via the embedded worddistribution, the exemplary methods are aware of the center of the topicand thus can find representative words of the topic.

The output of the system can thus be a recommendation or suggestion of acandidate topic which can spark conversation(s) between one or morepeople. The practical application relates to dialog systems where peopleare encouraged to converse. The elements of the methods and systems arethus integrated into the practical application of a dialog system totrigger and maintain conversations. The improvement involves predictingrelevant or pertinent or suitable topics of interest from existing topicdistributions or conversation topics. The candidate topics can becontinuously updated or adjusted based on mean values of the inputsreceived by the system in real-time. As conversations or conversationtopics between people change or evolve, the mean and variance of thereceived constructs affect what candidate topics to recommendation orpropose or suggest to propagate meaningful conversations. Thepredictions of topics can evolve as the conversations evolve intovarious topic areas.

The present invention is applicable to a wide variety of dialog systemmodalities, both input and output, capable of responding toconversational inputs such as, but not limited to, speech, writing(e.g., text or handwriting), touch, gesture, and combinations thereof(e.g., multi-mode systems) addressed to the computer or another human ina multi-user conversation. For example, the dialog system can beresponsive to an instant messaging conversation (i.e., text-basedconversational inputs) or a voice conversation (i.e., speech-basedconversational inputs) between users that includes conversational inputsthat can be expressly or implicitly addressed to the dialog system.Embodiments generally described in the context of a modality-specificdialog system (e.g., a spoken dialog system) are merely illustrative ofone suitable implementation and should not be construed as limiting thescope to any particular modality or modalities or a single modality andshould be read broadly to encompass other modalities or inputs alongwith the corresponding hardware and/or software modifications toimplement other modalities.

FIG. 6 is a block/flow diagram illustrating an example processing systemfor generating tool-specific alerting rules based on abnormal and normalpatterns from history logs, in accordance with an embodiment of thepresent invention.

The processing system includes at least one processor (CPU) 604operatively coupled to other components via a system bus 602. A cache606, a Read Only Memory (ROM) 608, a Random Access Memory (RAM) 610, aninput/output (I/O) adapter 620, a network adapter 630, a user interfaceadapter 640, and a display adapter 650, are operatively coupled to thesystem bus 602. Additionally, a dialog system platform 660 cancommunicate through the system bus 602. Moreover, a generation model canoperate via the system bus 602.

A storage device 622 is operatively coupled to system bus 602 by the I/Oadapter 620. The storage device 622 can be any of a disk storage device(e.g., a magnetic or optical disk storage device), a solid statemagnetic device, and so forth.

A transceiver 632 is operatively coupled to system bus 602 by networkadapter 630.

User input devices 642 are operatively coupled to system bus 602 by userinterface adapter 640. The user input devices 642 can be any of akeyboard, a mouse, a keypad, an image capture device, a motion sensingdevice, a microphone, a device incorporating the functionality of atleast two of the preceding devices, and so forth. Of course, other typesof input devices can also be used, while maintaining the spirit of thepresent invention. The user input devices 642 can be the same type ofuser input device or different types of user input devices. The userinput devices 642 are used to input and output information to and fromthe processing system.

A display device 652 is operatively coupled to system bus 602 by displayadapter 650.

Of course, the processing system can also include other elements (notshown), as readily contemplated by one of skill in the art, as well asomit certain elements. For example, various other input devices and/oroutput devices can be included in the system, depending upon theparticular implementation of the same, as readily understood by one ofordinary skill in the art. For example, various types of wireless and/orwired input and/or output devices can be used. Moreover, additionalprocessors, controllers, memories, and so forth, in variousconfigurations can also be utilized as readily appreciated by one ofordinary skill in the art. These and other variations of the processingsystem are readily contemplated by one of ordinary skill in the artgiven the teachings of the present invention provided herein.

FIG. 7 is a block/flow diagram of an exemplary cloud computingenvironment, in accordance with an embodiment of the present invention.

It is to be understood that although this invention includes a detaileddescription on cloud computing, implementation of the teachings recitedherein are not limited to a cloud computing environment. Rather,embodiments of the present invention are capable of being implemented inconjunction with any other type of computing environment now known orlater developed.

Cloud computing is a model of service delivery for enabling convenient,on-demand network access to a shared pool of configurable computingresources (e.g., networks, network bandwidth, servers, processing,memory, storage, applications, virtual machines, and services) that canbe rapidly provisioned and released with minimal management effort orinteraction with a provider of the service. This cloud model can includeat least five characteristics, at least three service models, and atleast four deployment models.

Characteristics are as follows:

On-demand self-service: a cloud consumer can unilaterally provisioncomputing capabilities, such as server time and network storage, asneeded automatically without requiring human interaction with theservice's provider.

Broad network access: capabilities are available over a network andaccessed through standard mechanisms that promote use by heterogeneousthin or thick client platforms (e.g., mobile phones, laptops, and PDAs).

Resource pooling: the provider's computing resources are pooled to servemultiple consumers using a multi-tenant model, with different physicaland virtual resources dynamically assigned and reassigned according todemand. There is a sense of location independence in that the consumergenerally has no control or knowledge over the exact location of theprovided resources but can be able to specify location at a higher levelof abstraction (e.g., country, state, or datacenter).

Rapid elasticity: capabilities can be rapidly and elasticallyprovisioned, in some cases automatically, to quickly scale out andrapidly released to quickly scale in. To the consumer, the capabilitiesavailable for provisioning often appear to be unlimited and can bepurchased in any quantity at any time.

Measured service: cloud systems automatically control and optimizeresource use by leveraging a metering capability at some level ofabstraction appropriate to the type of service (e.g., storage,processing, bandwidth, and active user accounts). Resource usage can bemonitored, controlled, and reported, providing transparency for both theprovider and consumer of the utilized service.

Service Models are as follows:

Software as a Service (SaaS): the capability provided to the consumer isto use the provider's applications running on a cloud infrastructure.The applications are accessible from various client devices through athin client interface such as a web browser (e.g., web-based e-mail).The consumer does not manage or control the underlying cloudinfrastructure including network, servers, operating systems, storage,or even individual application capabilities, with the possible exceptionof limited user-specific application configuration settings.

Platform as a Service (PaaS): the capability provided to the consumer isto deploy onto the cloud infrastructure consumer-created or acquiredapplications created using programming languages and tools supported bythe provider. The consumer does not manage or control the underlyingcloud infrastructure including networks, servers, operating systems, orstorage, but has control over the deployed applications and possiblyapplication hosting environment configurations.

Infrastructure as a Service (IaaS): the capability provided to theconsumer is to provision processing, storage, networks, and otherfundamental computing resources where the consumer is able to deploy andrun arbitrary software, which can include operating systems andapplications. The consumer does not manage or control the underlyingcloud infrastructure but has control over operating systems, storage,deployed applications, and possibly limited control of select networkingcomponents (e.g., host firewalls).

Deployment Models are as follows:

Private cloud: the cloud infrastructure is operated solely for anorganization. It can be managed by the organization or a third party andcan exist on-premises or off-premises.

Community cloud: the cloud infrastructure is shared by severalorganizations and supports a specific community that has shared concerns(e.g., mission, security requirements, policy, and complianceconsiderations). It can be managed by the organizations or a third partyand can exist on-premises or off-premises.

Public cloud: the cloud infrastructure is made available to the generalpublic or a large industry group and is owned by an organization sellingcloud services.

Hybrid cloud: the cloud infrastructure is a composition of two or moreclouds (private, community, or public) that remain unique entities butare bound together by standardized or proprietary technology thatenables data and application portability (e.g., cloud bursting forload-balancing between clouds).

A cloud computing environment is service oriented with a focus onstatelessness, low coupling, modularity, and semantic interoperability.At the heart of cloud computing is an infrastructure that includes anetwork of interconnected nodes.

Referring now to FIG. 7, illustrative cloud computing environment 750 isdepicted for enabling use cases of the present invention. As shown,cloud computing environment 750 includes one or more cloud computingnodes 710 with which local computing devices used by cloud consumers,such as, for example, personal digital assistant (PDA) or cellulartelephone 754A, desktop computer 754B, laptop computer 754C, and/orautomobile computer system 754N can communicate. Nodes 710 cancommunicate with one another. They can be grouped (not shown) physicallyor virtually, in one or more networks, such as Private, Community,Public, or Hybrid clouds as described hereinabove, or a combinationthereof. This allows cloud computing environment 750 to offerinfrastructure, platforms and/or software as services for which a cloudconsumer does not need to maintain resources on a local computingdevice. It is understood that the types of computing devices 754A-Nshown in FIG. 7 are intended to be illustrative only and that computingnodes 710 and cloud computing environment 750 can communicate with anytype of computerized device over any type of network and/or networkaddressable connection (e.g., using a web browser).

FIG. 8 is a schematic diagram of exemplary abstraction model layers, inaccordance with an embodiment of the present invention. It should beunderstood in advance that the components, layers, and functions shownin FIG. 8 are intended to be illustrative only and embodiments of theinvention are not limited thereto. As depicted, the following layers andcorresponding functions are provided:

Hardware and software layer 860 includes hardware and softwarecomponents. Examples of hardware components include: mainframes 861;RISC (Reduced Instruction Set Computer) architecture based servers 862;servers 863; blade servers 864; storage devices 865; and networks andnetworking components 866. In some embodiments, software componentsinclude network application server software 867 and database software868.

Virtualization layer 870 provides an abstraction layer from which thefollowing examples of virtual entities can be provided: virtual servers871; virtual storage 872; virtual networks 873, including virtualprivate networks; virtual applications and operating systems 874; andvirtual clients 875.

In one example, management layer 880 can provide the functions describedbelow. Resource provisioning 881 provides dynamic procurement ofcomputing resources and other resources that are utilized to performtasks within the cloud computing environment. Metering and Pricing 882provide cost tracking as resources are utilized within the cloudcomputing environment, and billing or invoicing for consumption of theseresources. In one example, these resources can include applicationsoftware licenses. Security provides identity verification for cloudconsumers and tasks, as well as protection for data and other resources.User portal 883 provides access to the cloud computing environment forconsumers and system administrators. Service level management 884provides cloud computing resource allocation and management such thatrequired service levels are met. Service Level Agreement (SLA) planningand fulfillment 885 provide pre-arrangement for, and procurement of,cloud computing resources for which a future requirement is anticipatedin accordance with an SLA.

Workloads layer 890 provides examples of functionality for which thecloud computing environment can be utilized. Examples of workloads andfunctions which can be provided from this layer include: mapping andnavigation 891; software development and lifecycle management 892;virtual classroom education delivery 893; data analytics processing 894;transaction processing 895; and a generation model 896.

As used herein, the terms “data,” “content,” “information” and similarterms can be used interchangeably to refer to data capable of beingcaptured, transmitted, received, displayed and/or stored in accordancewith various example embodiments. Thus, use of any such terms should notbe taken to limit the spirit and scope of the disclosure. Further, wherea computing device is described herein to receive data from anothercomputing device, the data can be received directly from the anothercomputing device or can be received indirectly via one or moreintermediary computing devices, such as, for example, one or moreservers, relays, routers, network access points, base stations, and/orthe like.

To provide for interaction with a user, embodiments of the subjectmatter described in this specification can be implemented on a computerhaving a display device, e.g., a CRT (cathode ray tube) or LCD (liquidcrystal display) monitor, for displaying information to the user and akeyboard and a pointing device, e.g., a mouse or a trackball, by whichthe user can provide input to the computer. Other kinds of devices canbe used to provide for interaction with a user as well; for example,feedback provided to the user can be any form of sensory feedback, e.g.,visual feedback, auditory feedback, or tactile feedback; and input fromthe user can be received in any form, including acoustic, speech, ortactile input.

The present invention can be a system, a method, and/or a computerprogram product. The computer program product can include a computerreadable storage medium (or media) having computer readable programinstructions thereon for causing a processor to carry out aspects of thepresent invention.

The computer readable storage medium can be a tangible device that canretain and store instructions for use by an instruction executiondevice. The computer readable storage medium can be, for example, but isnot limited to, an electronic storage device, a magnetic storage device,an optical storage device, an electromagnetic storage device, asemiconductor storage device, or any suitable combination of theforegoing. A non-exhaustive list of more specific examples of thecomputer readable storage medium includes the following: a portablecomputer diskette, a hard disk, a random access memory (RAM), aread-only memory (ROM), an erasable programmable read-only memory (EPROMor Flash memory), a static random access memory (SRAM), a portablecompact disc read-only memory (CD-ROM), a digital versatile disk (DVD),a memory stick, a floppy disk, a mechanically encoded device such aspunch-cards or raised structures in a groove having instructionsrecorded thereon, and any suitable combination of the foregoing. Acomputer readable storage medium, as used herein, is not to be construedas being transitory signals per se, such as radio waves or other freelypropagating electromagnetic waves, electromagnetic waves propagatingthrough a waveguide or other transmission media (e.g., light pulsespassing through a fiber-optic cable), or electrical signals transmittedthrough a wire.

Computer readable program instructions described herein can bedownloaded to respective computing/processing devices from a computerreadable storage medium or to an external computer or external storagedevice via a network, for example, the Internet, a local area network, awide area network and/or a wireless network. The network can includecopper transmission cables, optical transmission fibers, wirelesstransmission, routers, firewalls, switches, gateway computers and/oredge servers. A network adapter card or network interface in eachcomputing/processing device receives computer readable programinstructions from the network and forwards the computer readable programinstructions for storage in a computer readable storage medium withinthe respective computing/processing device.

Computer readable program instructions for carrying out operations ofthe present invention can be assembler instructions,instruction-set-architecture (ISA) instructions, machine instructions,machine dependent instructions, microcode, firmware instructions,state-setting data, or either source code or object code written in anycombination of one or more programming languages, including an objectoriented programming language such as Smalltalk, C++ or the like, andconventional procedural programming languages, such as the “C”programming language or similar programming languages. The computerreadable program instructions can execute entirely on the user'scomputer, partly on the user's computer, as a stand-alone softwarepackage, partly on the user's computer and partly on a remote computeror entirely on the remote computer or server. In the latter scenario,the remote computer can be connected to the user's computer through anytype of network, including a local area network (LAN) or a wide areanetwork (WAN), or the connection can be made to an external computer(for example, through the Internet using an Internet Service Provider).In some embodiments, electronic circuitry including, for example,programmable logic circuitry, field-programmable gate arrays (FPGA), orprogrammable logic arrays (PLA) can execute the computer readableprogram instructions by utilizing state information of the computerreadable program instructions to personalize the electronic circuitry,in order to perform aspects of the present invention.

Aspects of the present invention are described herein with reference toflowchart illustrations and/or block diagrams of methods, apparatus(systems), and computer program products according to embodiments of theinvention. It will be understood that each block of the flowchartillustrations and/or block diagrams, and combinations of blocks in theflowchart illustrations and/or block diagrams, can be implemented bycomputer readable program instructions.

These computer readable program instructions can be provided to at leastone processor of a general purpose computer, special purpose computer,or other programmable data processing apparatus to produce a machine,such that the instructions, which execute via the processor of thecomputer or other programmable data processing apparatus, create meansfor implementing the functions/acts specified in the flowchart and/orblock diagram block or blocks or modules. These computer readableprogram instructions can also be stored in a computer readable storagemedium that can direct a computer, a programmable data processingapparatus, and/or other devices to function in a particular manner, suchthat the computer readable storage medium having instructions storedtherein includes an article of manufacture including instructions whichimplement aspects of the function/act specified in the flowchart and/orblock diagram block or blocks or modules.

The computer readable program instructions can also be loaded onto acomputer, other programmable data processing apparatus, or other deviceto cause a series of operational blocks/steps to be performed on thecomputer, other programmable apparatus or other device to produce acomputer implemented process, such that the instructions which executeon the computer, other programmable apparatus, or other device implementthe functions/acts specified in the flowchart and/or block diagram blockor blocks or modules.

The flowchart and block diagrams in the Figures illustrate thearchitecture, functionality, and operation of possible implementationsof systems, methods, and computer program products according to variousembodiments of the present invention. In this regard, each block in theflowchart or block diagrams can represent a module, segment, or portionof instructions, which includes one or more executable instructions forimplementing the specified logical function(s). In some alternativeimplementations, the functions noted in the blocks can occur out of theorder noted in the figures. For example, two blocks shown in successioncan, in fact, be executed substantially concurrently, or the blocks cansometimes be executed in the reverse order, depending upon thefunctionality involved. It will also be noted that each block of theblock diagrams and/or flowchart illustration, and combinations of blocksin the block diagrams and/or flowchart illustration, can be implementedby special purpose hardware-based systems that perform the specifiedfunctions or acts or carry out combinations of special purpose hardwareand computer instructions.

Reference in the specification to “one embodiment” or “an embodiment” ofthe present principles, as well as other variations thereof, means thata particular feature, structure, characteristic, and so forth describedin connection with the embodiment is included in at least one embodimentof the present principles. Thus, the appearances of the phrase “in oneembodiment” or “in an embodiment”, as well any other variations,appearing in various places throughout the specification are notnecessarily all referring to the same embodiment.

It is to be appreciated that the use of any of the following “/”,“and/or”, and “at least one of”, for example, in the cases of “A/B”, “Aand/or B” and “at least one of A and B”, is intended to encompass theselection of the first listed option (A) only, or the selection of thesecond listed option (B) only, or the selection of both options (A andB). As a further example, in the cases of “A, B, and/or C” and “at leastone of A, B, and C”, such phrasing is intended to encompass theselection of the first listed option (A) only, or the selection of thesecond listed option (B) only, or the selection of the third listedoption (C) only, or the selection of the first and the second listedoptions (A and B) only, or the selection of the first and third listedoptions (A and C) only, or the selection of the second and third listedoptions (B and C) only, or the selection of all three options (A and Band C). This can be extended, as readily apparent by one of ordinaryskill in this and related arts, for as many items listed.

Having described preferred embodiments of a system and method forinteractively building a topic model employing semantic similarity in adialog system (which are intended to be illustrative and not limiting),it is noted that modifications and variations can be made by personsskilled in the art in light of the above teachings. It is therefore tobe understood that changes may be made in the particular embodimentsdescribed which are within the scope of the invention as outlined by theappended claims. Having thus described aspects of the invention, withthe details and particularity required by the patent laws, what isclaimed and desired protected by Letters Patent is set forth in theappended claims.

What is claimed is:
 1. A computer implemented method executed on aprocessor for building a topic model to discover topics in a collectionof documents generated by a plurality of users, the method comprisingsteps of: extracting conversations from the collection of documents;dividing the extracted conversations into a plurality of segments;generating a topic distribution for each of the plurality of segmentsbased on the extracted conversations and a first pre-defined priorprobability distribution; generating continuous value constructs foreach of the topic distributions based on an external corpus and a secondpre-defined prior probability distribution, wherein similarity isdefined between the continuous value constructs; and generating aGaussian distribution for each topic distribution of each document ofthe collection of documents.
 2. The method of claim 1, furthercomprising observing the continuous value constructs in each of theplurality of segments.
 3. The method of claim 2, further comprisingestimating parameters of the topic distributions, constructdistributions, and hidden topics for the continuous value constructs byusing Gibbs sampling.
 4. The method of claim 3, further comprisingselecting an appropriate segment of the plurality of segments based ontime.
 5. The method of claim 4, further comprising generating acandidate topic and constructs in the candidate topic by using thesecond pre-defined prior probability distribution.
 6. The method ofclaim 1, wherein a generation probability is adjusted based on adistance between the continuous value constructs and a mean value of acorresponding topic distribution.
 7. The method of claim 1, whereinconstructs distribution for each topic category is a Gaussiandistribution with fixed variance and means calculated by correspondingvalues of constructs within a same topic category whose priordistribution can be a Gaussian distribution with mean μ₀ and variance σ₀²I.
 8. A non-transitory computer-readable storage medium comprising acomputer-readable program executed on a processor in a data processingsystem for building a topic model to discover topics in a collection ofdocuments generated by a plurality of users, wherein thecomputer-readable program when executed on the processor causes acomputer to perform the steps of: extracting conversations from thecollection of documents; dividing the extracted conversations into aplurality of segments; generating a topic distribution for each of theplurality of segments based on the extracted conversations and a firstpre-defined prior probability distribution; generating continuous valueconstructs for each of the topic distributions based on an externalcorpus and a second pre-defined prior probability distribution, whereinsimilarity is defined between the continuous value constructs; andgenerating a Gaussian distribution for each topic distribution of eachdocument of the collection of documents.
 9. The non-transitorycomputer-readable storage medium of claim 8, wherein the continuousvalue constructs are observed in each of the plurality of segments. 10.The non-transitory computer-readable storage medium of claim 9, whereinparameters of the topic distributions, construct distributions, andhidden topics for the continuous value constructs are estimated by usingGibbs sampling.
 11. The non-transitory computer-readable storage mediumof claim 10, wherein an appropriate segment of the plurality of segmentsis selected based on time.
 12. The non-transitory computer-readablestorage medium of claim 11, wherein a candidate topic and constructs inthe candidate topic are generated by using the second pre-defined priorprobability distribution.
 13. The non-transitory computer-readablestorage medium of claim 8, wherein a generation probability is adjustedbased on a distance between the continuous value constructs and a meanvalue of a corresponding topic distribution.
 14. The non-transitorycomputer-readable storage medium of claim 8, wherein constructsdistribution for each topic category is a Gaussian distribution withfixed variance and means calculated by corresponding values ofconstructs within a same topic category whose prior distribution can bea Gaussian distribution with mean μ₀ and variance σ₀ ²I.
 15. An systemfor building a topic model to discover topics in a collection ofdocuments generated by a plurality of users, the system comprising: amemory; and one or more processors in communication with the memoryconfigured to: extract conversations from the collection of documents;divide the extracted conversations into a plurality of segments;generate a topic distribution for each of the plurality of segmentsbased on the extracted conversations and a first pre-defined priorprobability distribution; generate continuous value constructs for eachof the topic distributions based on an external corpus and a secondpre-defined prior probability distribution, wherein similarity isdefined between the continuous value constructs; and generate a Gaussiandistribution for each topic distribution of each document of thecollection of documents.
 16. The system of claim 15, wherein thecontinuous value constructs are observed in each of the plurality ofsegments.
 17. The system of claim 16, wherein parameters of the topicdistributions, construct distributions, and hidden topics for thecontinuous value constructs are estimated by using Gibbs sampling. 18.The system of claim 17, wherein an appropriate segment of the pluralityof segments is selected based on time.
 19. The system of claim 18,wherein a candidate topic and constructs in the candidate topic aregenerated by using the second pre-defined prior probabilitydistribution.
 20. The system of claim 15, wherein a generationprobability is adjusted based on a distance between the continuous valueconstructs and a mean value of a corresponding topic distribution.